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Abstract 

The Histogram of Oriented Gradient (HOG) descrip¬ 
tor has led to many advances in computer vision over the 
last decade and is still part of many state of the art ap¬ 
proaches. We realize that the associated feature computa¬ 
tion is piecewise differentiable and therefore many pipelines 
which build on HOG can be made differentiable. This 
lends to advanced introspection as well as opportunities for 
end-to-end optimization. We present our implementation of 
VHOG based on the auto-differentiation toolbox Chumpy 
[18] and show applications to pre-image visualization and 
pose estimation which extends the existing differentiable 
renderer OpenDR [19] pipeline. Both applications improve 
on the respective state-of-the-art HOG approaches. 


1. Introduction 

Since the original presentation of the Histogram of Ori¬ 
ented Gradients (HOG) descriptor [4] it has seen many use 
cases beyond its initial target application to pedestrain de¬ 
tection. Most prominently it is a core building block of the 
widely used Deformable Part Model (DPM) object class de¬ 
tector [9] and exemplar models [23] which both on their 
own have seen many follow-up approaches. Most recently, 
HOG-based approaches have repeatedly shown good gen¬ 
eralization performance to rendered [ 1 ] and artistic images 
[2], while such type of generalizations are non-trivial to 
achieve in recently very successful deep learning models in 
vision [24]. 

As all feature representations also HOG seek a reduction 
of information in order to arrive at a more compact repre¬ 
sentation of the visual input that is more robust to nuisances 
such as noise and illumination. It is specified as a mapping 
of an image into the HOG space. The resulting representa¬ 
tion is then typically further used in classification or match¬ 
ing approaches to solve computer vision tasks. 

While HOG is only defined as a feed-forward computa¬ 
tion and introduces an information bottleneck, sometimes 
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Figure 1: We exploit the piecewise differentiability of the 
popular HOG descriptor for end-to-end optimization. The 
figure shows applications on the pre-image reconstruction 
given HOG features as well as the pose estimation task 
based on the same idea. 


we desire to invert this pipeline for further analysis. E.g. 
previous work has tried visualize HOG features by solv¬ 
ing an pre-image problem [28, 13]. Given a HOG repre¬ 
sentation of an unobserved input image, the approaches try 
to estimate an image that produces the same HOG repre¬ 
sentation and is close to the original image. This has been 
addressed by sampling approach and approximation of the 
HOG computation in order to circumvent the problem of the 
non-invertible HOG computation. Another example, is pose 
estimation based on 3D models [31, 1, 25, 27] that exploits 
renderings of 3D models in order to learn a pose prediction 
model. Here the HOG computation is followed up by a De¬ 
formable Part Model [9] or simplified versions that related 
to the Exemplar Model [23]. Typically, these methods em¬ 
ploy sampling based approaches in order to render discrete 
view-points that are then used in a learning-based scheme 
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to match to images. 

In our work, we investigate directly computing the gra¬ 
dient of the HOG representation which then can be used 
for end-to-end optimization of the input w.r.t. the desired 
output. For the visualization via pre-image estimation, we 
observe the HOG representation and compute the gradient 
w.r.t. the raw pixels of the input image. For pose estimation 
we consider the whole pose scoring pipeline of [ 1 ] that con¬ 
stitutes a model with multiple parts and a classifier on top 
of the HOG representation. Here we show how to directly 
maximize the pose scoring function by computing the gra¬ 
dient w.r.t. the pose parameters. In contrast to the previous 
approach, we do not reply on pre-rendering views exhaus¬ 
tively and our pose estimation error is therefore not limited 
by the initial sampling. 

We compare to previous works on HOG visualizations 
and HOG-based pose estimation using rendered images. By 
using our approach of end-to-end optimization via differen¬ 
tiation of the HOG computation, we improve over the state 
of the art on both tasks. 

2. Related Work 

The HOG feature representation is widely used in many 
computer vision based applications. Despite its popularity, 
its appearance in the objective functions usually makes the 
optimization problem hard to operate where it is treated as 
a non-differentiable function [12, 32]. How to invert the 
the feature descriptor to inspect its original observation in¬ 
vokes a line of works of feature inversion and feature visu¬ 
alization (pre-image) problem. There are plenty of works 
on this interesting topic. Given the HOG features of a test 
image, Vondrick et al. [28] tried in their baseline to op¬ 
timize the objective with HOG involved by the numerical 
derivatives but failed to get reasonable results, thus in their 
proposed method the inversion is done by learning a paired 
dictionary of features and the corresponding images. Wein- 
zaepfel et al. [30] attempted to reconstruct the pre-image 
of the given SIFT descriptors [21] based on nearest neigh¬ 
bor search in a huge database of images for patches with 
the closet descriptors. Kato et al. [13] study the prob¬ 
lem of pre-image estimation of the bag-of-words features 
and they rely on a large-scale database to optimize the spa¬ 
tial arrangement of visual words. Although these and other 
related works provide different ways to approximately il¬ 
lustrate the characteristic of the image features, we nearly 
have not seen the work directly addressing the differentiable 
form of the feature extraction procedure. In contrast, our ap¬ 
proach contributes to make the differentiation of HOG de¬ 
scriptor practical such that it can be easily plugged into the 
computer vision pipeline to enable direct end-to-end opti¬ 
mization and extension to hybrid MCMC schemes [15, 16]. 
One most relevant work to ours is from Mahendran et al. 
[22], which inverts feature descriptors (HOG [9], SIFT [21], 


and CNNs [14]) for a direct analysis of the visual informa¬ 
tion contained in representations, where HOG and SIFT are 
implemented by Convolutional Neural Networks (CNNs). 
However, their approach contains an approximation to the 
orientation binning stage of HOG/SIFT and includes two 
strong natural image priors in the objective function with 
some parameters need to be estimated from training set. In¬ 
stead in our work, we do not have any approximation in the 
HOG pipeline and no training is needed. 

Despite deep-learning based features are in fashion these 
years, there are plenty of applications using HOG, in par¬ 
ticular the Examplar LDA [11] for the pose estimation task 
with rendered/C AD data [1, 17, 3]. In [6], slightly-modified 
SIFT (gradient-histogram-based as HOG) can beat CNNs in 
feature matching task. In this paper, we specifically demon¬ 
strate the application of our VHOG on the pose estimation 
problem for aligning 3D CAD models to the objects on 2D 
real images, we briefiy review some recent research works 
here. Lim et al. [17] assume the accurate 3D CAD model 
of the target object is given, based on the discretized space 
of poses for initialization they estimate the poses from the 
correspondences of LDA patches between the real image 
and the rendered image of CAD model. Aubry et al. [1] 
create a large dataset of CAD models of chair objects, with 
rendering each CAD model from a large set of viewpoints 
they train the classifiers of discriminative exemplar patches 
in order to find the alignment between the chair object on 
the 2D image and the most similar CAD model of the cer¬ 
tain rendering pose. In additional to the discrete pose esti¬ 
mation scheme as [1], there has been works on continuous 
pose estimation [26, 3, 25]. For instance, Pepik et al. [25] 
train a continuous viewpoint regressor and also the RCNN- 
based [10] key-point detectors which are used to localize the 
key-points on 2D images in an object class specific fashion, 
with the correspondences between the key-points on the 2D 
image and 3D CAD model, they estimate the pose of the 
target object. However, for these current state-of-the-art ap¬ 
proaches most of them need to collect plenty of data to train 
the discriminative visual element detectors or key-point de¬ 
tectors for the matching, or to render many images of CAD 
models of various viewpoints in advance. Instead, our pro¬ 
posed method manages to combine the VHOG based ex¬ 
emplar LDA model with the approximate differentiable ren- 
derer from [19] which enable us to have directly end-to-end 
optimization for the pose parameters of the CAD model in 
alignment with the target object on the real images. 

3. VHOG 

Here we describe how we achieve the derivative of the 
HOG descriptor. In the original HOG computation, there 
are few sequential key-components, including 1) comput¬ 
ing gradients, 2) weighted vote into spatial and orientation 
cells, and 3) contrast normalization over overlapping spatial 


blocks. In our implementation we follow the same proce¬ 
dure. For each part we argue for piecewise differentiability. 
The differentiability of the whole pipeline follows from the 
chain rule of differentiation. Figure 2 shows an overview of 
the computations involved in the HOG feature computation 
pipeline which we describe in details in the following. 

3.1. Gradients Computation 

If a color image / G given, we first compute 

its gray-level image: 

/gray = /(:,:,0)*0.299-l-/(:,:,l)*0.587 + /(:,:,2)*0.114 

( 1 ) 

Then we follow the best setting for gradient computation 
as in DalaTs approach [4], to apply the discrete derivative 
1-D [—1, 0,1] masks on both horizontal and vertical direc¬ 
tions without Gaussian smoothing. We denote the gradient 
maps on horizontal and vertical directions as Gx and Gy, 
while the magnitude || V|| and direction 0 of gradients can 
be computed by: 

IIVII = ^Gl + Gl 
0 = arctan(Gy, Gx) 

Note that here we use unsigned orientations such that the 
numerical range of the elements in || V|| G [0,180]. The L2 
norm is denoted by || • || through this paper for consistency. 

Differentiability: The conversion to gray as well as the 
derivative computation via linear filtering are linear opera¬ 
tions and therefore differentiable, arctan is differentiable 
in M and the gradient magnitude || V|| is differentiable due 
to the chaining of the differentiable squaring function and 
the square root over values in M+. 

3.2. Weighted vote into spatial and orientation cells 

After we have the magnitude and direction of the gradi¬ 
ents, we proceed to do the weighted vote of gradients into 
spatial and orientation cells which provides the fundamen¬ 
tal nonlinearity of the HOG representation. The cells are the 
local spatial regions where we accumulated the local statis¬ 
tics of gradients by the histogram binning of their orienta¬ 
tions. Assume we divide the image region into x 
cells of size x Ch, for each pixel located within the cell 
we compute the weighted vote of its gradient orientation to 
an orientation histogram (In our setting we use the same set¬ 
ting as DalaTs to have the histogram of 9 bins spaced over 
0° — 180° which ignores the sign of the gradients). 

Normally for each cell its orientation histogram is rep¬ 
resented in a 1-D vector of length B (number of bins), but 
this operation will miss the positions of the pixels which 
contribute to the histogram. This does not lead to a formu¬ 
lation that allows for derivation of the HOG representation 


with respect to different pixel positions. Our main observa¬ 
tion here is to view each orientation binning as a filter 
applied to each location in the gradient map. We store the 
filtered results in G Analogously, the pixel- 

wise orientational filters {/^are chosen to follow 
the bi-linear interpolation scheme of the gradients in neigh¬ 
boring orientational bins: 

/,°(e) = clip"(l-|0-A^6|x^) (3) 

Fb = l|V||©/6°(e), V6G1---B (4) 

where jj^b is the central value of orientation degree for fil¬ 
ter /^, clip]!]^^^^Q function clamps the numerical range be¬ 
tween 1 and 0, and 0 is an element-wise multiplication. 
(Note that for the first filter /f we also take care of the nu¬ 
merical range. See the visualization shown in Figure 2.) 

We have the for orientational binning, we then ap¬ 
ply spatial binning for each cell. Here as in the DalaTs 
method, to reduce the aliasing, for each pixel location it 
will contribute to its 4 neighboring cells proportional to the 
distances to the centers of those cells, in another word, the 
votes are interpolated bilinearly. Following the similar trick 
as in orientational binning, by creating a 2c^ x 2ch bilinear 
filter where its maximum value 1 is in the center with 
decreasing values toward four comers to minimum value 0, 
as shown in Figure 2, we convolve it with all to get the 
spatial filtered results : 

Fi=F§*f, V6G1---B (5) 

then the spatial binning for cells can be easily fetched from; 

F^{x,y\xeA:,yey), V6G1---5 (6) 

where (A', 3^) are the {x,y) coordinates of the centers for 
all cells. 

Note that till here when you concatenate v = 
{F^{x,y\x e X,y e 3^)}^^^...^ then actually we get ex¬ 
actly the same representation as from original HOG ap¬ 
proach. 

Differentiability By re-representing the data, we have 
shown that the histograming and voting procedure of the 
HOG approach can be viewed as linear filtering operations 
followed up by a summation. Both steps are differentiable. 

3.3. Contrast normalization 

In the original procedure of DalaTs HOG descriptor, 
contrast normalization is performed on every local region 
of size 3x3 cells, which they call blocks. As many re¬ 
cent applications that we are interested in [1, 2, 13, 28, 9] 
do not use blocks, we do not consider them either in our 
implementation. While this step is possible to incorporate, 
it would also lead to increased computational costs due to 
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Figure 2: Visualization of the implementation procedure for our VHOG method. 


multiple representation of the same cell. We instead only 
use the global normalization by using the robust L2-norm. 
Given the HOG representation v from previous steps, the 
global contrast normalization can be written as: 

'^normalized — rr ,—n- G J 

V^lkll+e 

where e is a small positive constant. 

Differentiability: This is a chain of differentiable func¬ 
tions and therefore the whole expression is differentiable. 

Difference to Original HOG While there is a large diver¬ 
sity of HOG implementations available by now, we summa¬ 
rize the two main difference to the standard one as proposed 
in [4]: First, the original HOG compute the the gradients 
on different color channels and apply the maximum opera¬ 
tor on the magnitudes over all channels to get the gradient 
map. In our implementation we simply first transform the 
color image into gray scale and compute the gradient map 
directly. Second, we do not include the local contrast nor¬ 
malization for every overlapping spatial blocks. But we do 
include the global, robust L2 normalization. 

3.4. Implementation 


4. Experimental Results 

4.1. Reconstruction from HOG descriptors 

We evaluate our proposed VHOG method on the image 
reconstruction task based on the feature descriptors. We are 
interested in this task since it provides a way to visualize 
the information carried by the feature descriptors and open 
the opportunity to examine the feature descriptor itself in¬ 
stead of based on the performance measures of certain tasks 
as proxies. There is already prior work on this problem. 
[13, 28, 5] focus on different feature representations such 
as Bag-of-Visual-Words (BoVW), Histogram of Orientated 
Gradients (HOG), and Local Binary Descriptors (LBDs). 
However, state-of-the-art approaches typically need to use 
large-scale image bases for learning the reconstruction. 

Objective As we have derived the gradient of the HOG 
feature w.r.t. the input, we can - given a feature vector - 
directly optimize for the reconstruction of original image 
without any additional data needed. To define the problem 
more formally, let / G be an image and its HOG rep¬ 

resentation as 0(/), we optimize to find the reconstructed 
image I whose HOG features 0(1) have the minimum eu¬ 
clidean distance E io (j){I)\ 


In the above equations (Eq. 1, 2, 3, 5, 7) all the opera¬ 
tions are (piecewise-) differentiable (summation, multipli¬ 
cation, division, square, square root, arc-tangent, clip), with 
the use of the chain rule, our overall HOG implementation 
is differentiable on each pixel position. Overall, this is not 
surprising as visual feature representations are designed to 
vary smoothly w.r.t. to small changes in the image. We 
have implemented this version of the HOG descriptor by us¬ 
ing the Python-based auto-differentiation package Chumpy 
[18], which evaluates an expression and its derivatives with 
respect to its inputs. The package and our extension in¬ 
tegrate with the recently proposed Approximate Differen¬ 
tiable Renderer OpenDR [19]. We will make our imple¬ 
mentation publicly available in the near future. 


I = argmin E 


= argmin 

leR^xY 


m - m 


( 8 ) 


The option to approach the problem in this way was men¬ 
tioned in [28], however there was no result achieved as nu¬ 
merical differentiation is very computational expensive in 
this setting. Direct optimization is facilitated for the first 
time using our VHOG approach. 

An overview of our approach is shown in Figure 1 . We 
compute derivatives with respect to the intensity val¬ 
ues ix,y of all the pixel positions {x^y) on I via auto¬ 
differentiation. By gradient-based optimization we are able 






















































to find a local minimum of E and corresponding recon¬ 
structed image I. In order to regularize our estimation, 
we introduce a smoothness prior that penalizes gray value 
changes of adjacent pixels. Intuitively, this encourages 
propagation of information into areas without strong edges 
for which no signal from the HOG features is available. 


I = argmin 0(/) — 0(/) 


E I 

p,qEAf 


,11 ( 9 ) 


where p^q e Af means that pixel p and q are neighbors, and 
^ is the weight for the smoothness term which we usually 
set to a big number as 10^ in our experiments. Although 
we give a high weight for the smoothness term, it will only 
play a key role in the first few iterations of the optimization 
procedure then the euclidean distance E will dominate to 
find the local minimum. 

The evaluation is based on the image reconstruction 
dataset proposed in [13] which contains 101 images for all 
the categories from Caltech 101 dataset [8] and all have 
a resolution of 128 x 128. We compare our method with 
few state-of-the-art baselines on image reconstruction from 
feature descriptions: the BoVW method from [13], the 
HOGgles method from [28], also CNN-HOG and CNN- 
HOGb(CNN-HOG with bilinear orientation assignments) 
from [22]. 

Note that our VHOG described in Section 3 is 
based on Dalal’s-type HOG[4], while for HOGgles/CNN- 
HOG/CNN-HOGb baselines they are using UoCTTI-type 
HOG [9] which additionally considers directed gradients. 
To have a fair comparison, we also implement UoCTTI 
HOG under our proposed framework. 

We propose two additional variants for reconstruction 
that exploit multi-scale information as shown in Figure 3. 


VHOG multi-scale We use the single scale HOG de¬ 
scriptor as input, but we first reconstruct Ii with s times 

smaller resolution than / (the cell size for 0(/i) is ^ of 
the original one used for 0(/), s G {4,16, 64} in our ex¬ 
perimental setting.). After few iterations of updates in op¬ 
timization process, we up-sample /i to higher resolution 
and continue the reconstruction procedure. These steps are 
repeated until we reach the initial resolution of /. 




(a) VHOG multi-scale 


(b) VHOG multi-scale-more 


Figure 3: Visualizations of variants of the proposed method 
for the task of image reconstruction from feature descrip¬ 
tors. 


The optimization is done based on the nonlinear opti¬ 
mization using Powell’s dogleg method [20] which is im¬ 
plemented in Chumpy [18]. Example results of the multi 
scale approaches can be seen in Table 1 . 



Table 1: Example results for (a)(b) VHOG multi-scale 
and VHOG multi-scale-more in which both are based on 
Dalal-HOG[4]; and (c) for VHOG on UoCTTI-HOG[9]. 


VHOG multi-scale-more We use the multi scale HOG 
vectors of the original image / as the input. Eor the recon¬ 
struction on different scale s, the corresponding HOG de¬ 
scriptor 0(/i) extracted on the same scale is used in the eu¬ 
clidean distance E, as shown in Eigure 3(b). As additional 
HOG descriptors are computed from the original image at 
different scales, we use more information than in the origi¬ 
nal setup and therefore the results of the “multi-scale-more” 
approach cannot be directly compared to prior works. 


Results In order to quantify the performance of image re¬ 
construction, different metrics have been proposed in prior 
works. Eor instance, in [13] the mean squared error of 
raw pixels is utilized, while in [28] the cross-correlation is 
chosen to compare the similarity between the reconstructed 
image and the original one. In addition to using cross¬ 
correlation as the metric for qualitative evaluation, we also 
investigate different choices used by the research works on 
the problem of image quality assessment (IQA), includ- 






















ing mutual information and Structural Similarity (SSIM) 
[29]. In particular, mutual information measures the mutual 
dependencies between images hence gives another metric 
for similarities, while SSIM measures the degradation of 
structural information for the distorted/reconstructed image 
from the original one, under the assumption that human vi¬ 
sual perception is adapted to discriminate the structural in¬ 
formation from the image. 

We report the performance numbers from all the metrics 
in Table 2. The proposed method using UoCTTI-type HOG 
outperforms the state-of-the-art baselines by a large margins 
for all metrics. Visually inspected, our proposed method 
can reconstruct many details in the images and also give ac¬ 
curate estimate on gray-scale values if using UoCTTI HOG. 
Please note again, our method does not need any additional 
data for training while in baselines training is necessary. 

4.2. Pose estimation 

We also evaluate our VHOG approach on a pose esti¬ 
mation task where 3D CAD models have to be aligned to 
objects in 2D images. We build on openDR [19] which is 
an approximate differentiable renderer. It parameterizes the 
forward graphics model / based on vertices locations V, 
per-vertex brightness A and camera parameters C, which is 
shown on the left part of Figure 5, where U is for the 2D 
projected vertex coordinate position. Based on the auto- 
differentiation techniques, openDR provides a way to de¬ 
rive the derivatives of the rendered image observation with 
respect to the parameters in the rendering pipeline. 

Approach We extend openDR in the following ways as 
illustrated in Figure 5 : 1) We parameterize the vertices lo¬ 
cations V of CAD models by three parameters: azimuth 0, 
elevation and distance to the camera 7; 2) During the 
pose estimation procedure, as in [1], the matching between 
the objects on real images and the rendered images from 
the CAD models are addressed by the similarities between 
the HOG descriptors of the visual discriminative elements 
extracted from them. The detailed procedure of extracting 
visual discriminative elements is discussed in [1]. In our 
method, we use our VHOG method 0(P/) for the image 
patches Pf which have the same regions as the visual ele¬ 
ments Pi extracted from the test image /, and the similarity 
between the Pf and Pj is the dot product between HOG 
descriptors 0(P/) of P/ and (j){Pf). As shown in the right 
part of Figure 5 this similarity can be traversed back to the 
pose parameters and the derivatives of the simi¬ 

larity with respect to the pose parameters can be again com¬ 
puted by the auto-differentiation, our method can directly 
optimize to maximize the similarity to estimate the poses. 

Setup We follow the same experimental setting as [1], 
where we test on the images annotated with no-occlusion. 



Figure 4: Visualization of the similarity and its gradients 
w.r.t azimuth. The red boxes on the HOG representations 
are the visual discriminative patches. 


no-truncation and not-difficult of the chairs validation set 
on PASCAL VOC 2012 dataset [7], therefore in total 247 
chairs from 179 images are used for the evaluation. To 
purely focus on evaluation of the pose estimation, we ex¬ 
tract the object images based on their bounding boxes anno¬ 
tation, and resize them to have at least length of 100 pixels 
on the shortest side of image size. 

The baseline [1] is applied on the chair images to search 
over a chair CAD database of 1393 models which includes 
the rendered images from 62 different poses relative to cam¬ 
era for each of them, then to detect the chairs, match the 
styles of the chairs, and simultaneously recover their poses 
based on rendered images. We select the most confident 
detection for each chair together with the estimated pose. 

We apply our proposed method on pose estimation by 
using the elevation and azimuth estimates of [1] as a ini¬ 
tialization of pose, and add few more initializations for az¬ 
imuth (8 equidistantly distribute over 360°). We use gradi¬ 
ent descent method with momentum term for optimization 
in order to optimize for the azimuth parameter and inter¬ 
leave iterations in which we additionally optimize for the 
distance to camera. In Figure 4 we visualize an example 
of the similarity between the chair object on the real image 
and the CAD model on the rendered image, as well as its 
gradients w.r.t azimuth 0 (full 360°). We can see how gra¬ 
dients change related to different local maximums and the 
corresponding poses of the CAD model. 














Method 

cross correlation 

mutual information 

Structural similarity (SSIM) [29] 


BoVW [13] 

0.287 

1.182 

0.252 

UoCTTI 
HOG [9] 

HOGgles [28] 

0.409 

1.497 

0.271 

CNN-HOG [22] 

0.632 

1.211 

0.381 

CNN-HOGb [22] 

0.657 

1.597 

0.387 

our VHOG (single scale) 

0.760 

1.908 

0.433 

Dalai’s 

HOG [4] 

our VHOG (single scale) 

0.170 

1.464 

0.301 

our VHOG (multi-scale: ^) 

0.058 

1.444 

0.121 

our VHOG (multi-scale: j^) 

0.076 

1.470 

0.147 

our VHOG (multi-scale: p 

0.108 

1.458 

0.221 

our VHOG (multi-scale: |) 

0.147 

1.478 

0.293 

our VHOG (multi-scale-more: ^) 

0.147 

1.458 

0.251 

our VHOG (multi-scale-more: j^) 

0.191 

1.502 

0.291 

our VHOG (multi-scale-more: |) 

0.220 

1.565 

0.320 

our VHOG (multi-scale-more: j) 

0.236 

1.582 

0.338 


Table 2: Comparison on the performance of reconstruction from feature descriptors. 
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Figure 5: (left) The differentiable rendering procedure from 
openDR [19]. (right) The visualization of our model for 
pose estimation. 


Results In order to quantify our performance on pose 
estimation task, we use the continuous 3D pose anno¬ 
tations from PASCAL3D-F dataset [31]. Following the 
same evaluation scheme, the view-point estimation is 
considered to be correct if its estimated viewpoint label is 
within the same interval of the discrete viewpoint space 
as the ground-truth annotation, or its difference with 
ground-truth in continuous viewpoint space is lower than a 
threshold. We evaluate the performance based on various 
settings of the intervals and thresholds in viewpoint space: 
{4 views/90°, 8 views/45°, 16 views/22.5°, 24 views/15°}. 
In Table 3 we report the performance numbers for Aubry’s 
baseline and our proposed approach. We are outperforming 
the previous best performance up to 10% points on the 
coarse and fine measures. Some example results which 


show improvements of the baseline method are shown in 
Table 4. 

Discussion One advantage of our proposed method is that 
we are able to parameterize the vertexes coordinates of 
the CAD models by the same pose parameters as used in 
[1], then the differentiable rendering procedure provided by 
openDR [19] and our VHOG representations enable us to 
directly compute the derivatives of the similarity with re¬ 
spect to the pose parameters, and optimize for continuous 
pose parameters. In another word, for the proposed ap¬ 
proach we do not need to discretize the parameters as [1] 
and do not need to render images from many poses in ad¬ 
vance for the alignment procedure either. 



4 views 

8 views 

16 views 

24 views 

Aubry et al. [1] 

47.33 

35.39 

20.16 

15.23 

our method 

58.85 

40.74 

22.22 

16.87 


Table 3: Pose estimation results based on the groundtruth 
annotation from PASCAL3D-F [31]. 


5. Conclusions 

We investigate the feature extraction pipeline of HOG 
descriptor and exploit its piecewise differentiability. Based 
on the implementation using auto-differentiation tech¬ 
niques, the derivatives of the HOG representation can be 
directly computed. We study two problems of image re¬ 
construction from HOG features and HOG-based pose es¬ 
timation while the direct end-to-end optimization becomes 
practical with our VHOG. We demonstrate that our VHOG- 
based approaches outperforms the state-of-the-art baselines 
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Table 4: Example results for pose estimation. 



Our VHOG 
(single-scale) 


UoCTTI-HOG 



Our VHOG Our VHOG 


Our VHOG 


(single-scale) 


(multi-scale) 


(multi-scale-more) 



Dalal-HOG 


Dalal-HOG 


Dalal-HOG 


Table 5: Example results for image reconstruction from feature descriptors. 


for both problems. We have demonstrated that the approach 
can lead to improved introspection via visualizations and 
improved performance via direct optimization through a 
whole vision pipeline. Our implementation is integrated 
into an existing auto-differentiation package as well as the 
recently proposed Approximately Differentiable Renderer 
OpenDR [19] and is publicly available. Therefore it is easy 
to adopt to new tasks and is applicable to a range of end-to- 
end optimization problems. 
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